16 research outputs found
A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Audio-visual source localization is a challenging task that aims to predict
the location of visual sound sources in a video. Since collecting ground-truth
annotations of sounding objects can be costly, a plethora of weakly-supervised
localization methods that can learn from datasets with no bounding-box
annotations have been proposed in recent years, by leveraging the natural
co-occurrence of audio and visual signals. Despite significant interest,
popular evaluation protocols have two major flaws. First, they allow for the
use of a fully annotated dataset to perform early stopping, thus significantly
increasing the annotation effort required for training. Second, current
evaluation metrics assume the presence of sound sources at all times. This is
of course an unrealistic assumption, and thus better metrics are necessary to
capture the model's performance on (negative) samples with no visible sound
sources. To accomplish this, we extend the test set of popular benchmarks,
Flickr SoundNet and VGG-Sound Sources, in order to include negative samples,
and measure performance using metrics that balance localization accuracy and
recall. Using the new protocol, we conducted an extensive evaluation of prior
methods, and found that most prior works are not capable of identifying
negatives and suffer from significant overfitting problems (rely heavily on
early stopping for best results). We also propose a new approach for visual
sound source localization that addresses both these problems. In particular, we
found that, through extreme visual dropout and the use of momentum encoders,
the proposed approach combats overfitting effectively, and establishes a new
state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code
and pre-trained models are available at https://github.com/stoneMo/SLAVC
Tree of Uncertain Thoughts Reasoning for Large Language Models
While the recently introduced Tree of Thoughts (ToT) has heralded
advancements in allowing Large Language Models (LLMs) to reason through
foresight and backtracking for global decision-making, it has overlooked the
inherent local uncertainties in intermediate decision points or "thoughts".
These local uncertainties, intrinsic to LLMs given their potential for diverse
responses, remain a significant concern in the reasoning process. Addressing
this pivotal gap, we introduce the Tree of Uncertain Thoughts (TouT) - a
reasoning framework tailored for LLMs. Our TouT effectively leverages Monte
Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse
local responses at these intermediate steps. By marrying this local uncertainty
quantification with global search algorithms, TouT enhances the model's
precision in response generation. We substantiate our approach with rigorous
experiments on two demanding planning tasks: Game of 24 and Mini Crosswords.
The empirical evidence underscores TouT's superiority over both ToT and
chain-of-thought prompting methods
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Continual learning is a challenging problem in which models need to be
trained on non-stationary data across sequential tasks for class-incremental
learning. While previous methods have focused on using either regularization or
rehearsal-based frameworks to alleviate catastrophic forgetting in image
classification, they are limited to a single modality and cannot learn compact
class-aware cross-modal representations for continual audio-visual learning. To
address this gap, we propose a novel class-incremental grouping network (CIGN)
that can learn category-wise semantic features to achieve continual
audio-visual learning. Our CIGN leverages learnable audio-visual class tokens
and audio-visual grouping to continually aggregate class-aware features.
Additionally, it utilizes class tokens distillation and continual grouping to
prevent forgetting parameters learned from previous tasks, thereby improving
the model's ability to capture discriminative audio-visual categories. We
conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and
VGG-Sound Sources benchmarks. Our experimental results demonstrate that the
CIGN achieves state-of-the-art audio-visual class-incremental learning
performance. Code is available at https://github.com/stoneMo/CIGN.Comment: ICCV 2023. arXiv admin note: text overlap with arXiv:2303.1705
Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models
Data augmentation has become a standard component of vision pre-trained
models to capture the invariance between augmented views. In practice,
augmentation techniques that mask regions of a sample with zero/mean values or
patches from other samples are commonly employed in pre-trained models with
self-/semi-/fully-supervised contrastive losses. However, the underlying
mechanism behind the effectiveness of these augmentation techniques remains
poorly explored. To investigate the problems, we conduct an empirical study to
quantify how data augmentation affects performance. Concretely, we apply 4
types of data augmentations termed with Random Erasing, CutOut, CutMix and
MixUp to a series of self-/semi-/fully- supervised pre-trained models. We
report their performance on vision tasks such as image classification, object
detection, instance segmentation, and semantic segmentation. We then explicitly
evaluate the invariance and diversity of the feature embedding. We observe
that: 1) Masking regions of the images decreases the invariance of the learned
feature embedding while providing a more considerable diversity. 2) Manual
annotations do not change the invariance or diversity of the learned feature
embedding. 3) The MixUp approach improves the diversity significantly, with
only a marginal decrease in terms of the invariance
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Visual and linguistic pre-training aims to learn vision and language
representations together, which can be transferred to visual-linguistic
downstream tasks. However, there exists semantic confusion between language and
vision during the pre-training stage. Moreover, current pre-trained models tend
to take lots of computation resources for fine-tuning when transferred to
downstream tasks. In this work, we present a simple but effective approach for
learning Contrastive and Adaptive representations of Vision and Language,
namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn
alignments between the whole sentence and each image in the same batch during
the pre-training process. At the fine-tuning stage, we introduce two
lightweight adaptation networks to reduce model parameters and increase
training speed for saving computation resources. We evaluate our CAVL on six
main downstream tasks, including Visual Question Answering (VQA), Visual
Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR),
Region-to-Phrase Grounding (RPG), Text-to-Image Retrieval (TIR), and Zero-shot
Text-to-Image Retrieval (ZS-TIR). Compared to baselines, we achieve superior
performance and reduce the fine-tuning time by a large margin (in particular,
76.17%). Extensive experiments and ablation studies demonstrate the efficiency
of contrastive pre-training and adaptive fine-tuning proposed in our CAVL
Audio-Visual Class-Incremental Learning
In this paper, we introduce audio-visual class-incremental learning, a
class-incremental learning scenario for audio-visual video recognition. We
demonstrate that joint audio-visual modeling can improve class-incremental
learning, but current methods fail to preserve semantic similarity between
audio and visual features as incremental step grows. Furthermore, we observe
that audio-visual correlations learned in previous tasks can be forgotten as
incremental steps progress, leading to poor performance. To overcome these
challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity
Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic
similarity between audio-visual modalities and Visual Attention Distillation
(VAD) to retain previously learned audio-guided visual attentive ability. We
create three audio-visual class-incremental datasets, AVE-Class-Incremental
(AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and
VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and
VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and
VS100-CI demonstrate that AV-CIL significantly outperforms existing
class-incremental learning methods in audio-visual class-incremental learning.
Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.Comment: Accepted at ICCV 202
MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things
The Internet of Things (IoT), the network integrating billions of smart
physical devices embedded with sensors, software, and communication
technologies for the purpose of connecting and exchanging data with other
devices and systems, is a critical and rapidly expanding component of our
modern world. The IoT ecosystem provides a rich source of real-world modalities
such as motion, thermal, geolocation, imaging, depth, sensors, video, and audio
for prediction tasks involving the pose, gaze, activities, and gestures of
humans as well as the touch, contact, pose, 3D of physical objects. Machine
learning presents a rich opportunity to automatically process IoT data at
scale, enabling efficient inference for impact in understanding human
wellbeing, controlling physical devices, and interconnecting smart cities. To
develop machine learning technologies for IoT, this paper proposes MultiIoT,
the most expansive IoT benchmark to date, encompassing over 1.15 million
samples from 12 modalities and 8 tasks. MultiIoT introduces unique challenges
involving (1) learning from many sensory modalities, (2) fine-grained
interactions across long temporal ranges, and (3) extreme heterogeneity due to
unique structure and noise topologies in real-world sensors. We also release a
set of strong modeling baselines, spanning modality and task-specific methods
to multisensory and multitask models to encourage future research in
multisensory representation learning for IoT